Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrating to Google Auth Library #180

Closed
wants to merge 24 commits into from
Closed

Conversation

junpeng-jp
Copy link

@junpeng-jp junpeng-jp commented May 29, 2022

Summary of changes

  • (Migrate to google-auth from oauth2client which is now deprecated  #89) Removed all dependencies on oauth2client and transitioned to the use of google-auth
  • (Migrate to google-auth from oauth2client which is now deprecated  #89) Added lockfile storage behavior with slight modification to how saving works
    • Only authorized user credentials can be saved to backend
    • Service credential are not saved because of private key duplication risks
    • Authorized credential is loaded only once (lazily) after GoogleAuth is instantiated
    • Authorized credential is saved only once after the first consent & token exchange
    • Only credentials with refresh tokens should be saved so the backend can refresh a token if expired. A warning is raised if refresh token is not present for backwards compatibility
  • (deprecation of OAuth out-of-band (OOB) in Google OAuth #173) Deprecate command line auth with a warning and defaults to using LocalWebserverAuth w/ loopback host
  • Modified the threadsafe code to follow the recommendations by google
  • Replaced the use of decorators with lazy loading properties that handle the initialization of the client config, flow object, credentials, etc.

Potentially solves the following issue as well:

@junpeng-jp
Copy link
Author

Tests ran properly after removing some of the leading / in the yaml files
image

@junpeng-jp
Copy link
Author

@shcheklein Not sure if you've seen this

@shcheklein
Copy link
Member

@junpeng-jp hi, yes! I've seen ... I'll definitely come and review this! Thanks for making this PR 🙏

@junpeng-jp
Copy link
Author

I wasn't very thorough with adding new test cases for some of the new errors in google-auth library. Let me know if I've missed something in the code

@shcheklein
Copy link
Member

@junpeng-jp no worries, we can add tests later if needed.

First things I would be looking for are:

  • Multi threading. There was some logic around http client and thread local store to make one instance per thread. That was important to make it thread safe. We can find some PRs in the past for the context.
  • Token refresh logic. Make sure that workflow is automatic and clean. E.g. we run an API call a day after token is expired - it needs to refresh the token first (in a thread safe manner), save the new token, make the call.
  • General compatibility - list the things / workflows / APIs that we don't support anymore / we change and if there are recommend ways of covering this going forward.

Again, I'll take a look as soon as I can also.

@junpeng-jp
Copy link
Author

junpeng-jp commented May 31, 2022

For multi-threading, I've updated the library to follow the recommendation here by the authors of google-api-python-client. Change all references to self.http -> self.Get_Http_Object() which creates a new Http object for each thread. Haven't tested any actual multi-threading though

@shcheklein
Copy link
Member

Change all references to self.http -> self.Get_Http_Object() which creates a new Http object for each thread.

does it mean it creates and authorizes a new object per every request? It might mean that we won't have the connections pool, and in general it's more expensive.

One more question - is refresh token logic now handled in the Authorize call? does it save back the new token into the file?

@junpeng-jp
Copy link
Author

junpeng-jp commented Jun 5, 2022

does it mean it creates and authorizes a new object per every request? It might mean that we won't have the connections pool, and in general it's more expensive.

I'm kind of new to multiprocessing in general. I've largely referenced this old article & the previous linked documentation by google on how to share credentials across threads

  1. Create a Credential object that is shared across all threads (do auth & consent once)
  2. For each thread, a new HTTP object is created referencing this shared Credential object
  3. Refresh requests are made automatically in the request method of the AuthorizedHttp object
    • when the Http object makes a request, it first checks whether the shared credentials have expired
    • if expired, gets the shared Credentials to refresh itself (this is considered a synchronous refresh)
    • else, it simply uses the shared credentials to adds the http headers to this request via the self.credentials.before_request(...) (docs here)

One more question - is refresh token logic now handled in the Authorize call? does it save back the new token into the file?

There's no need to manually refresh credentials anymore if i'm not wrong. The in-memory Credential objects would have the latest token & shared across all threads and their AuthorizedHttp objects. Refresh requests are made automatically by the AuthorizedHttp / AuthorizedSession / any other transport related objects in google-auth library whenever they're making a request to the API (usually in the request method or something similar).

If backend is specified, credentials are saved only once after the user provides consent & exchanges the access token + refresh token. A new Credential object can be created from the file if the in memory Credentials were deleted. This new credential is treated as an expired credential & will automatically refresh by the Http / Session objects using the saved refresh token

@shcheklein
Copy link
Member

For each thread, a new HTTP object is created referencing this shared Credential object

Yep. But ideally it's cached per thread (it's shared across multiple requests within the same thread). There was logic with thread local:

https://github.com/iterative/PyDrive2/blob/master/pydrive2/auth.py#L78-L80

There's no need to manually refresh credentials anymore if i'm not wrong.

Got it. Need to check since previously it was saving it back on each refresh, I wonder what is the difference here (clearly it means that it will be running refresh every time you restart the script, but are there any other consequences?)

self.auth.Authorize()

# Ensure that a thread-safe HTTP object is provided.
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like providing a custom http into a request won't work anymore. It might be needed to scenarios like proxy servers, etc. When users take responsibility in providing and managing HTTP clients.

Copy link
Author

@junpeng-jp junpeng-jp Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share some examples of how this decorator can be used to provide a custom http into a request? I'm not really familiar with the above scenarios

Regarding the previous discussion on caching of Http object per thread, I will add thread_local back to the auth object now that I understand its purpose

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junpeng-jp any progress on that?

Copy link
Author

@junpeng-jp junpeng-jp Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this as part of a new lazily-loaded property called authorized_http. It would create a AuthorizedHttp object and add it to self.thread_local if it doesn't exist. Else, just retrieve it from self.thread_local. Tagged you below

pydrive2/auth.py Outdated
@@ -147,20 +65,6 @@ class GoogleAuth(ApiAttributeMixin):
and authorization.
"""

DEFAULT_SETTINGS = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: why do we move the outside the class? (it will break some project that rely on this to be inside)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really expect anybody to use this outside of the GoogleAuth constructor. I can definitely add it back as a class static variable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't use it anymore in DVC, still let's not move things in this PR, let's focus on the migration first, please

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted the DEFAULT_SETTINGS to a class static variable

pydrive2/auth.py Outdated Show resolved Hide resolved
@CheckAuth
if self.storage:
self.SaveCredentials()

def CommandLineAuth(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to replace it here with LocalWebServer (we are changing it's behavior) - it's better to get rid of it completely (and we will bump the major version)

Copy link
Author

@junpeng-jp junpeng-jp Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to understand what are the use-cases of loading credentials from a backend. For now, I imagine only the following use-case:

App wants to cache the user's token and refresh token so that the next time he returns, the App can simply identify the user, retrieve the authorized user credentials from backend & silently refresh the token without a need for the User to re-auth & exchange a new token

In this case, you would:

  • Save the authorized user credentials after the first authentication store the user's identity & refresh token for silent re-auth
  • Whenever a new session starts, identify the user, retrieve authorized credentials from backend, and create a new GoogleAuth object

Copy link
Author

@junpeng-jp junpeng-jp Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For service auth use-cases, I can't think of a reason to duplicate your service account private key elsewhere besides the first known location of this file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to understand what are the use-cases of loading credentials from a backend.

hmm, I'm not sure I understand how this question is relevant to the comment I made, could you clarify please - My point is that it's a good opportunity to completely get rid of the CommandLineAuth, wdyt?

Could you clarify your comments please?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i might have misread this somehow. Sure, I think we can get rid of CommandLineAuth. Should I raise a deprecation Warning or just delete it outright?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can raise "NotImplemented" or something with an explanation, sounds good to me

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've raised a DeprecationWarning Error which tells the users to use the local webserver auth with a loopback address

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still, let's not fall back to the serve implementation. It's better to raise an exception in this case. Changing behavior this way can be unexpected / lead to some unintended consequences.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it will be raising an exception here

@shcheklein
Copy link
Member

@junpeng-jp sorry for the delay, I've been busy, but this PR is important and I'll get to it soon.

@lappazos
Copy link

@shcheklein @junpeng-jp i need this PR, any idea when will it be merged?

@shcheklein
Copy link
Member

@lappazos sorry, it's on me. I need to come back and do the next iteration of reviewing it.

and "http" in kwargs["param"]
and kwargs["param"]["http"] is not None
):
self.http = kwargs["param"]["http"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are still this logic, I think the way it worked was something like file.Upload(param={http=...})

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add back this functionality

pydrive2/auth.py Outdated
@property
def authorized_http(self):
# Ensure that a thread-safe, Authorized HTTP object is provided
# If HTTP object not specified, create or resuse an HTTP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is outdated / copy-pasted ... HTTP object could not be provided here

pydrive2/auth.py Outdated
# The Installed App Flow raises OSError when binding to a used port

# More on TCP state changes: https://users.cs.northwestern.edu/~agupta/cs340/project2/TCPIP_State_Transition_Diagram.pdf
if e.errno not in (10048, 48):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I still don't understand this logic - my understanding is that we can just try another port in case of an error (any OSError).

except OSError:
                pass

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhhh yeahh. There's no need for me to specifically catch these 2 error codes. Will update this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


# in-case reauth / consent required
# create a flow object so that reauth flow can be triggered
additional_config = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, don't quite understand this. How was this handled before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the whole create a flow object so that reauth flow can be triggered part

Copy link
Author

@junpeng-jp junpeng-jp Jul 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past, self.flow was populated using self.GetFlow() as part of the CheckAuth decorator, which handles the whole refresh process. Now that refresh is handled by the google_auth lib, I've opted to create the Flow object within the self.LoadCredentialsFile() using the recently loaded credentials

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you also have that lazy loading property flow that is being used internally, or do you mean that we have to create it again in certain cases?

also, btw, I think flow() now duplicates GetFlow which is not nice.

@@ -75,6 +75,7 @@
"client_service_email": {"type": str, "required": False},
"client_pkcs12_file_path": {"type": str, "required": False},
"client_json_file_path": {"type": str, "required": False},
"use_default": {"type": bool, "required": False},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we skip for the sake of this PR the default credentials auth? I think it should be done outside of the ServiceAuth + probably we should be using a different setting for this (don't know which one yet)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll shift this into a subsequent PR once we've migrated to new google auth library

@junpeng-jp
Copy link
Author

junpeng-jp commented Jul 31, 2022

@shcheklein I realised that there's quite still a lot of issues with this PR that needs fixing but I was wondering if it's okay for me to close this PR & raise a new one? There's been a large number of new commits over the 1month gap that I'm having trouble resolving merge conflicts

I'll start threads on the unresolved comments above once i've raised a new PR

@shcheklein
Copy link
Member

@junpeng-jp

it's okay for me to close this PR & raise a new one

Yes, it'a absolutely okay, nw. Thanks for doing this!

@junpeng-jp
Copy link
Author

junpeng-jp commented Aug 6, 2022

Closing this PR in favor of a new one @ #221. Will be raising the pull request soon

@junpeng-jp junpeng-jp closed this Aug 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants